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1.  Introduction 


1.1  Motivations 

The  US  Army  Research  Laboratory  (ARL)  is  developing  an  acoustic  system  to  de¬ 
tect  and  localize  launch  points  and  impact  points  of  different  kinds  of  weapons  in 
the  field.  Proper  identification  of  the  detected  events  would  filter  out  the  informa¬ 
tion,  increase  situational  awareness,  and  allow  the  Soldier  to  take  appropriate  action. 
For  example,  it  can  provide  a  warning  for  incoming  threats,  or  can  cue  a  radar  or  a 
high-resolution  camera. 

Experimental  work  over  the  last  decade  on  classification  of  acoustic  signatures  col¬ 
lected  from  ARL  acoustic  arrays  suggests  that  the  classifier  can  perform  satisfac¬ 
torily  in  environments  on  which  it  has  been  trained.  However,  it  has  difficulty  gen¬ 
eralizing  to  different  environments.  Yet,  due  to  practical  limitation  and  cost,  the 
collected  data  cannot  cover  all  possible  atmospheric  conditions,  terrain  geometries, 
and  source-sensor  ranges.  This  report  examines  this  issue  using  a  variety  of  classi¬ 
fiers  and  setups.  It  takes  one  attempt  at  quantifying  the  effect  of  propagation  chan¬ 
nels.  It  also  points  to  other  research  directions  that  are  believed  to  be  able  to  get 
around  this  fundamental  limitation. 

1.2  Sparsity-Based  Representation  for  Transient  Acoustic  Signals 

For  the  last  decade,  sparse  signal  representations  have  proven  to  be  extremely  pow¬ 
erful  tools  in  solving  many  inverse  problems,  where  sparsity  acts  as  a  strong  prior 
to  alleviate  the  ill-posed  nature  of  the  problem.  Recent  research  has  pointed  out  that 
sparse  representation  is  also  useful  for  discriminative  applications.1  3  These  appli¬ 
cations  rely  on  the  crucial  observation  that  test  samples  in  the  same  class  usually 
lie  in  a  low-dimensional  subspace  of  some  proper  bases  or  dictionaries.  Thus,  if 
the  dictionary  is  constructed  from  all  the  training  samples  in  all  the  classes,  the 
test  samples  can  be  sparsely  represented  by  only  a  few  columns  of  this  dictio¬ 
nary.  Therefore,  the  sparse  coefficient  vector,  which  is  recovered  efficiently  via  t\- 
minimization  techniques,  can  naturally  be  considered  as  the  discriminative  factor. 
In  Wright  et  al., 2  the  authors  successfully  applied  this  idea  to  the  face  recognition 
problem.  Since  then,  many  more  sophisticated  techniques  have  been  exploited  and 
applied  to  various  fields,  such  as  hyperspectral  target  detection,4  chemical  plume 
detection  and  classification,5  and  visual  classification. 1,6  7 
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Nowadays,  many  real-world  problems  involve  simultaneous  representations  of  mul¬ 
tiple  correlated  signals.  These  applications  normally  face  the  scenario  where  data 
sampling  is  performed  simultaneously  from  multiple  colocated  sources  (such  as 
multiple  channels  or  sensors)  —  yet  within  a  small  spatio-temporal  neighborhood, 
recording  the  same  physical  event.  This  data  collection  scenario  allows  exploita¬ 
tion  of  correlated  features  within  the  signal  sources  to  improve  the  resulting  signal 
representation  and  guide  successful  decision  making.  Joint  sparsity  models,  which 
assume  the  fact  that  multiple  measurements  belonging  to  the  same  class  can  be 
simultaneously  sparsely  represented  by  a  few  common  training  samples  in  the  dic¬ 
tionaries,  have  been  successfully  applied  in  many  applications.  For  instance,  a  joint 
sparse  representation  (JSR)-based  method  is  proposed  in  Chen  et  al.8  for  target 
detection  in  hyperspectral  imagery.  The  model  exploits  the  fact  that  the  sparse  rep¬ 
resentations  of  hyperspectral  pixels  in  a  small  neighborhood  tend  to  share  common 
sparsity  patterns.  Yuan  and  Yan7  investigated  a  multitask  model  for  visual  classifi¬ 
cation,  which  also  assumes  that  multiple  observations  from  the  same  class  could  be 
simultaneously  represented  by  a  few  columns  of  the  training  dictionary.  Similarly, 
Nguyen  et  al.9  investigated  a  multi-sensor  classification  framework  via  a  multivari¬ 
ate  sparse  representation  that  forces  different  recording  sensors  to  share  the  same 
sparse  support  distributions  on  their  coefficient  vectors. 

In  this  report,  we  first  develop  a  JSR  model  which  imposes  row-sparsity  constraints 
across  multiple  acoustic  sensors  to  collaboratively  classify  transient  acoustic  sig¬ 
nals.  Furthermore,  we  robustify  our  models  to  deal  with  the  presence  of  large  and 
dense  but  correlated  signal-interference/noise  (namely  low-rank  interference).  This 
scenario  is  normally  observed  when  the  recorded  data  are  the  superimpositions  of 
target  signals  with  interferences,  which  can  be  signals  from  external  sources,  ef¬ 
fects  of  propagation  channel  on  multiple  sensors,  the  underlying  background  that  is 
inherently  anchored  in  the  data,  or  any  pattern  noise  that  remains  stationary  during 
signal  collection.  These  interferences  normally  have  correlated  structure  and  ap¬ 
pear  as  a  low-rank  signal  interference  since  the  sensors  are  spatially  co-located  and 
data  samples  are  temporally  recorded,  thus  any  interference  from  external  sources 
will  affect  similarly  on  all  the  multiple  sensor  measurements.  Another  extension  of 
our  sparsity -based  representation  models  is  the  incorporation  of  group- structured- 
sparsity  constraints  among  observations  of  multiple  sensors.  The  group-sparse  con¬ 
straint  is  concurrently  enforced  with  row-sparse  constraints  among  the  support  co¬ 
efficients  of  all  sensors,  thus  yielding  one  more  layer  of  classification  robustness. 
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1.3  Contributions 


The  main  contributions  of  this  technical  report  are  as  follows: 

•  We  develop  a  variety  of  novel  sparsity-regularized  regression  methods  that 
effectively  incorporate  simultaneous  structured-sparsity  constraints,  demon¬ 
strated  via  a  row-sparse  and/or  group-sparse  coefficient  matrix,  across  multi¬ 
ple  acoustic  sensors.  We  also  robustify  our  models  to  deal  with  the  presence 
of  a  large  but  low-rank  interference  term.  While  row-sparsity  constraint  in  the 
JSR  model  has  been  extensively  studied  and  applied  in  many  application  dis¬ 
ciplines,  the  models  with  the  incorporation  of  low-rank  interference  as  well 
as  the  integration  of  both  joint-  and  group-sparsity  structure  are  our  own  de¬ 
velopments.  Preliminary  classification  results  of  these  methods  for  a  subset  of 
ARL  acoustic  transients  data  set  were  presented  as  parts  of  our  paper,  which 
will  soon  be  published  in  IEEE  Transactions  on  Signal  Processing .10 

•  We  report  comprehensive  comparisons  of  the  classification  performance  for 
12  different  classifiers,  including  6  conventional  classifiers  previously  devel¬ 
oped  and  examined  in  ARL,  4  newly  developed  sparsity-based  representation 
models,  and  2  emerging  deep  learning  architecture  techniques. 

•  The  classification  performance  of  all  competing  methods  is  examined  on  a 
variety  of  experimental  setups.  Two  broad  sets  of  classification  problems  are 
conducted:  a  4-class  problem  (launch  and  impact  of  2  projectiles)  and  a  6- 
class  problem  (launch  and  impact  of  3  projectiles).  Furthermore,  classifica¬ 
tion  accuracy  is  evaluated  for  different  partitioning  of  training  and  testing 
sets,  effectively  showing  the  performance  degradation  due  to  the  propagation 
of  the  signals  through  the  environment.  In  addition,  classification  rates  are 
summarized  and  analyzed  for  both  weighted  and  non-weighted  classification 
results.  Experimental  setups  and  empirical  results  are  discussed  and  analyzed 
in  detail  in  Section  4  of  the  report. 

1.4  Notations  and  Outline 

The  following  notational  conventions  are  used  throughout  this  report.  We  denote 
vectors  by  boldface  lowercase  letters,  such  as  x,  and  denote  matrices  by  boldface 
uppercase  letters,  such  as  X.  For  a  matrix  X ,  Xl  3  represents  the  element  at  row 
ith  and  column  jth  of  X  while  a  bold  lowercase  letter  with  subscript,  such  as  Xj, 
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represents  its  jth  column.  The  £q-norm  of  a  vector  x  e  is  defined  as  ||x||y  = 
(Sili  \xi\qY^q  where  Xj  is  the  ith  element  of  a;.  Given  a  matrix  X  e  WNxM,  ||X||F, 
||X||1(J,  and  IIXII^  are  used  to  defined  its  Frobenious  norm,  mixed  G^-norm  and 
nuclear-norm,  respectively. 

The  remainder  of  this  report  is  organized  as  follows.  In  Section  2,  we  give  a  brief 
overview  of  sparse  representation  for  classification.  Section  3  introduces  various 
proposed  sparsity  models  based  on  different  assumptions  on  the  structures  of  coef¬ 
ficient  vectors  and  low-rank  noise/interference.  A  fast  and  efficient  algorithm  based 
on  the  alternating  direction  method  of  multipliers  (ADMM)11  to  solve  the  convex 
optimization  problems  that  arise  from  these  models  and  the  guarantee  of  conver¬ 
gence  to  the  optimal  solutions  are  also  outlined  in  this  Section.  Extensive  experi¬ 
ments  are  conducted  in  Section  4,  and  conclusions  and  future  works  are  drawn  in 
Section  5. 


2.  Background  Reviews 


2.1  Sparse  Representation  for  Classification 


Recent  years  have  witnessed  the  development  of  sparse  representation  techniques 
for  both  signal  recovery  and  classification.  In  classification,  in  order  to  let  the  prob¬ 
lem  work,  one  often  assumes  that  all  of  the  samples  that  belong  to  the  same  class 
lie  approximately  in  the  same  low-dimensional  subspace.2  Suppose  we  are  given  a 
dictionary  representing  C  distinct  classes  D  =  [Di,D 2,  ...,Dc\  £  RNxP,  where 
N  is  the  feature  dimension  of  each  sample,  and  the  c-th  class  subdictionary  Dc  has 
Pc  training  samples  {dCtP}p= i,...,pc,  resulting  in  a  total  of  P  =  Y2c=i  ^  samples  in 
the  dictionary  D.  To  label  a  test  sample  y  e  M.N,  it  is  often  assumed  that  y  can  be 
represented  by  a  subset  of  the  training  samples  in  D.  Mathematically,  y  is  written 
as 

/  Cfi  \ 

0. 2 


V  =  [DuD2,...,D 


c 


+  n  =  Da  +  n, 


(1) 


\ac  ) 


where  a  e  Mp  is  the  unknown  coefficient  vector,  and  n  is  the  low-energy  noise 
due  to  the  imperfection  of  the  test  sample  and  has  little  effect  on  the  classification 
decision.  For  simplicity,  the  presence  of  n  will  be  discarded  from  all  model  descrip¬ 
tions,  though  it  is  still  taken  into  consideration  by  the  fidelity  constraint  (a  penalty 
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term  with  Frobenious  norm)  in  the  optimization  process. 


The  sparsity  assumption  implies  that  only  a  few  coefficients  of  a  are  nonzero  and 
most  of  the  others  are  insignificant.  Particularly,  only  entries  of  a  that  are  associ¬ 
ated  with  the  class  of  the  test  sample  y  are  nonzero,  and  thus,  a  is  a  sparse  vector. 
Taking  this  prior  into  account,  many  methods  have  been  proposed  to  find  the  coef¬ 
ficient  vector  a  efficiently,  including  £  \  -norm  minimization,12  greedy  pursuit  (e.g., 
CoSaMP13  or  subspace  pursuit14),  and  iterative  hard  threshold,15  to  name  a  few.  In 
this  report,  we  favor  the  £\ -minimization  approach,  which  is  described  as  follows: 


min  ||a|L 

a  1 

s.t.  y  =  Da. 


(2) 


Once  the  coefficient  vector  a  is  obtained  from  Eq.  2,  the  next  step  is  to  assign  the 
test  sample  y  to  a  class  label.  This  can  be  determined  by  simply  taking  the  minimal 
residual  between  y  and  its  approximation  from  each  class  subdictionary: 

Class(y)  =  argmin  \\y  —  D(ac ||2  ,  (3) 

e=l,...,C 

where  ac  is  the  induced  vector  by  keeping  only  the  coefficients  corresponding  to 
the  c-th  class  in  a.  This  step  can  be  interpreted  as  assigning  the  class  label  of  y  to 
the  class  that  can  best  represent  y.  In  the  case  a  tested  signal  is  the  superposition 
of  2  or  more  classes,  the  class  assignment  step  can  be  modified  from  Eq.  3  by 
considering  if  the  residual  between  y  and  the  closest  representation  with  respect  to 
2  class  subdictionaries  is  smaller  than  a  certain  threshold. 

2.2  Joint  Sparse  Representation 

Single-measurement  sparse  representation  has  been  shown  to  be  efficient  for  clas¬ 
sification  tasks  because  it  provides  an  effective  way  to  approximate  the  test  sample 
from  the  training  examples.  However,  in  many  practical  applications,  we  are  often 
given  a  set  of  test  measurements  collected  from  different  observations  of  the  same 
physical  event.  An  obvious  question  is  how  to  simultaneously  exploit  the  informa¬ 
tion  from  various  sources  to  come  up  with  a  more  precise  classification  decision, 
rather  than  classifying  each  test  sample  independently  and  then  assigning  a  class 
label  via  a  simple  fusion  (e.g.,  a  voting  scheme).  An  active  line  of  research  recently 
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focuses  on  answering  this  question  using  JSR. 16-19  Mathematically,  given  an  unla¬ 
beled  set  of  M  test  samples  Y  =  [y \.y>- ... ,t/M ]  G  M.NxM  from  different  nearby 
spatio-temporal  observations,  we  again  assume  that  each  measurement  yrn  can  be 
compactly  represented  by  a  few  atoms  in  the  training  dictionary 

Y  =  [2/1, 2/2,  =  [Da1,Da2,...,DaM\  =  DA,  (4) 

where  A  =  [ai,a2,  ...,aM]  G  MPxM  is  the  unknown  coefficient  matrix.  In  the  joint 
sparsity  model,  the  sparse  coefficient  vectors  {am}% f=1  share  the  same  support  T, 
and  thus  the  matrix  A  is  a  row-sparse  matrix  with  only  |  F  |  nonzero  rows.  This  model 
is  the  extension  of  the  aforementioned  sparse  representation  for  classification  model 
to  multiple  observations  and  has  been  shown  to  enjoy  better  classification  in  various 
practical  applications7,16,17  as  well  as  being  able  to  reduce  the  sample  size  needed 
for  signal  reconstruction  applications18,20  when  the  row-sparsity  assumption  holds. 

To  recover  the  row-sparse  matrix  A,  the  following  joint  sparse  optimization  is  pro¬ 
posed: 

min  IIAIL 

A  (5) 

s.t.  Y  =  DA , 

where  the  norm  ||A||lq  with  q  >  1,  defined  as  ||A||lg  =  ||Oii:||  with  a^.’s 

being  rows  of  the  matrix  A,  encourages  shared  sparsity  patterns  across  multiple 
sensors.  This  norm  can  be  phrased  as  performing  an  f^-norm  across  the  rows  to 
enforce  “joint”  property  and  followed  by  an  fj-norm  along  the  columns  to  enforce 
“sparsity”.  It  is  clear  that  this  regularization  norm  encourages  shared  sparsity 
patterns  across  related  observations,  and  thus  the  solution  of  the  optimization  in  Eq. 
5  has  common  support  at  column  level. 

3.  Robust  Structural  Sparsity-Based  Representation  for 
Classification  of  Transient  Acoustic  Signals 

3.1  Joint  Sparsity  Model  for  Classification  of  Acoustic  Transients 

In  this  section,  we  discuss  a  general  JSR  framework  for  the  classification  of  tran¬ 
sient  acoustic  signals.  In  this  application,  the  acoustic  data  are  collected  during  the 
launch  and  impact  of  different  types  of  munitions  using  a  tetrahedral  acoustic  sen¬ 
sor  array,  hence  there  are  4  measurements  for  each  data  sample  (i.e.,  M  =  4). 
A  feature  vector  (e.g.,  cepstral  features  that  have  been  proved  effective  in  speech 
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recognition  and  acoustic  signal  classification)  is  then  extracted  from  each  measure¬ 
ment  and  concatenated  into  a  matrix  Y,  as  shown  in  Fig.  1.  Since  the  4  sensors 
simultaneously  measure  the  same  event,  the  coefficient  matrix  A  in  the  linear  rep¬ 
resentation  of  training  samples  in  the  dictionary  D  =  [Di,  D2,  ...,DC]  should  have 
common  sparse  supports.  In  other  words,  A  should  be  row-sparse.  After  a  joint  spar¬ 
sity  optimization  is  engaged  in  Y ,  its  class  label  is  then  determined  by  the  following 
minimal  residual  rule: 


Class(Y)  =  argmin 

c=l,...,C 


Y  -  DCAC 


1 


F 


(6) 


where  ||  •  \\F  is  the  Frobenious  norm  of  a  matrix;  D,  and  Ac  are  the  submatrices  from 
class  c-th  of  the  training  dictionary  D  and  the  coefficient  matrix  A  solved  by  Eq.  5, 
respectively.  This  step  can  be  interpreted  as  collaboratively  assigning  the  class  label 
of  Y  to  the  class  that  can  best  represent  all  samples  in  Y. 


Fig.  1  Joint  sparsity  framework  for  classification  of  transient  acoustic  signals 


3.2  Joint  Sparse  Representation  with  Low-Rank  Interference 

In  this  section,  we  study  a  robust  joint-sparsity  model  that  is  capable  of  coping  with 
the  dense  and  large  but  correlated  noise,  so-termed  low-rank  interference.  This  sce¬ 
nario  often  happens  when  there  are  external  sources  interfering  with  the  recording 
process  of  all  sensors.  When  the  sensors  are  closely  spaced,  interference  sources 
look  similar  across  the  sensors,  resulting  in  a  large  but  low-rank  corruption.  Figure  2 
illustrates  a  typical  setup  with  multiple  colocated  sensors  simultaneously  recording 
the  same  physical  events.  In  a  multi-model  setting,  sensor  colocation  normally  en¬ 
sures  that  interference/noise  patterns  are  very  similar,  hence  justifying  the  low-rank 
assumption.  These  interference  sources  may  include  sound  and  vibration  from  a 


Approved  for  public  release;  distribution  is  unlimited. 


7 


car  passing  by,  a  helicopter  hovering  nearby,  interference  from  any  radio-frequency 
source,  or  the  effect  of  propagation  on  the  signal  (in  the  cepstral  space).  Further¬ 
more,  in  many  situations  the  recorded  data  may  contain  not  only  the  signal  of  inter¬ 
est  but  also  the  intrinsic  background  that  normally  stays  stationary,  hence  promoting 
a  low-rank  background  interference.  Our  proposed  joint  sparse  representation  with 
low-rank  interference  (JSR+L)  model  is  expected  to  tackle  this  problem  by  extract¬ 
ing  the  low-rank  component  while  collaboratively  taking  advantage  of  correlated 
information  from  different  sensors. 


Fig.  2  A  general  multisensor  problem  with  unknown  low-rank  interference 

Mathematically,  let  Y  be  the  measurement  matrix.  Under  some  circumstances,  we 
are  not  able  to  observe  the  joint  sparse  representation  DA  in  Y  directly;  instead, 
we  observe  its  corrupted  version  Y  =  L  +  DA.  The  matrix  L  captures  interference 
with  the  prior  knowledge  that  L  is  low-rank.  To  separate  L  and  DA.  a  strategy 
model  that  simultaneously  fits  the  low-rank  approximation  on  L  and  a  joint  sparse 
^-regularization  on  the  coefficient  matrix  A  is  proposed  as 

mm  \\A\\hq  +  XL\\L\l  (?) 

s.t.  Y  =  DA  +  L, 

where  the  nuclear  matrix  norm  1 1  LI  I  is  a  convex-relaxation  version  of  the  rank 

II  1 1  * 

defined  as  the  sum  of  all  singular  values  of  the  matrix  L21,22;  and  AL  >  0  is  a 
weighting  parameter  balancing  the  2  regularization  terms. 

Once  the  solution  {A,  L}  of  Eq.  7  is  computed,  the  class  label  of  Y  is  decided  by 
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the  minimal  residual  rule 


^  ^2 

Class(Y)  =  argmin  Y  —  DCAC  —  Lc  ,  (8) 

c=l,...,C  F 

where  Dc,  Ac,  and  Lc  are  the  corresponding  induced  matrices  associated  with  the 
c-th  class. 

3.3  Simultaneous  Group-and-Joint  Sparse  Representation  with 
Low-Rank  Interference 

JSR+L  has  the  capability  to  extract  correlated  noise/interference  while  simultane¬ 
ously  performing  intercorrelation  of  multiple  sensors  in  the  coefficient  matrix  at 
row-level,  hence  boosting  the  overall  classification  result.  Moreover,  this  model  can 
even  be  strengthened  by  further  incorporating  the  group  sparsity  constraint  into  the 
coefficient  matrix  A.  The  idea  of  adding  group  structure  has  been  intensively  stud¬ 
ied23-24  and  theoretically  as  well  as  empirically  proven  to  better  represent  signals  in 
discriminative  purposes.  Especially,  this  concept  is  critically  beneficial  for  classi¬ 
fication  tasks  where  multiple  measurements  do  not  necessarily  represent  the  same 
signals  but  rather  come  from  the  same  set  of  classes.  This  leads  to  group  sparse 
representation  where  the  dictionary  atoms  are  grouped  and  the  sparse  coefficients 
are  enforced  to  have  only  a  few  active  groups  at  a  time,  resulting  a  2-level  sparsity 
model:  group-sparse  and  row-sparse  in  a  combined  cost  function. 

We  tentatively  apply  this  concept  to  the  JSR+L  model.  The  new  model  simulta¬ 
neously  searches  for  the  group-and-row  sparse  structure  representation  among  all 
sensors  and  low-rank  interference  and  is  termed  group-and-joint  sparse  representa¬ 
tion  with  low-rank  interference  (GJSR+L). 

c 

™Tn  PIIi,  +  ag^Pc||f  +  al||l|l 

AL  c=i  (9) 

s.t.  Y  =  DA  +  L , 

where  Ac  is  the  subcoefficient  matrix  of  A  induced  by  the  labeled  indexes  corre¬ 
sponding  to  class  c;  and  \G  >  0  is  the  weighting  parameter  of  the  group  constraint. 
The  optimization  in  Eq.  9  can  be  interpreted  as  follows:  the  first  term  ||.A||  1  encour¬ 
ages  row-wise  sparsity  across  all  sensors,  whereas  the  group  regularizer  defined  by 
the  second  term  tends  to  minimize  the  number  of  active  groups  in  the  same  coef¬ 
ficient  matrix  A.  The  third  term  penalizes  the  nuclear  norm  of  the  interference  as 
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discussed  in  the  previous  section.  In  succession,  the  model  enforces  A  to  be  both 
group-sparse  and  row-sparse,  in  parallel  with  extracting  the  low-rank  interference 
appearing  in  all  measurements  altogether.  Figure  3  shows  the  visual  comparison 
of  coefficient  matrices  of  joint-sparsity  structure  and  the  combined  group-and-joint 
sparsity  model.  Once  the  solutions  of  the  coefficient  matrix  and  low-rank  term  are 
recovered,  the  class  label  of  Y  is  decided  by  the  same  Eq.  8  as  in  the  JSR+L  method. 

Ai 

Ac 


(a)  (b) 

Fig.  3  a)  Coefficient  matrix  of  joint  sparsity  model  and  b)  coefficient  matrix  of  group-and-joint 
sparsity  model 

3.4  Algorithm 

In  this  section,  we  provide  a  fast  and  efficient  algorithm  based  on  the  ADMM11  to 
solve  for  the  proposed  multisensor  sparsity-based  representation  models.  GJSR+L 
is  the  most  general  method;  hence,  here  we  discuss  the  algorithm  to  solve  Eq.  9 
and  then  simplify  the  algorithm  to  generate  solutions  for  the  other  methods.  The 
augmented  Lagrangian  function  of  Eq.  9  is  defined  as 

c 

C(A,L,Z)  =  ^A\\lq  +Ag  ||Ac||f  +  Xl  ||-£»||*  (10) 

C— 1 

+  (Y  -  DA  -  L,  Z)  +  |  ||y  -DA-  L\\l , 

where  Z  is  the  Lagrangian  multiplier,  and  //  is  a  positive  penalty  parameter.  The 
algorithm,  formally  presented  in  Algorithm  1,  then  minimizes  C(A.  L.  Z)  with  re¬ 
spect  to  one  variable  at  a  time  by  keeping  others  fixed  and  then  updating  the  vari¬ 
ables  sequentially. 
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Inputs:  Matrices  Y  and  D,  weighting  parameters  XG  and  XL. 

Initializations:  A0  =  0 ,L0  —  0,  j  =  0. 

While  not  converged  do 

1.  Solve  for  Lj+1\  L]+x  =  argminL  C(Aj,L,  Zj ) 

2.  Solve  for  A3+1\  Aj+1  =  argmin^  C(A,  L; .  \ .  Z , ) 

3.  Update  the  multiplier:  Zj+i  =  Zt  +  ji(Y  —  DAJ+l  —  Lj+1) 

4-  j  =j  +  1. 

end  while 

Outputs:  (A,L)  =  ( Aj.Lj ). 

Algorithm  1.  ADMM  for  GJSR+L 

Algorithm  1  involves  2  main  subproblems  to  solve  for  the  intermediate  minimiza¬ 
tions  with  respect  to  variables  L  and  A  at  each  iteration  j,  respectively.  The  first 
optimization  subproblem  that  updates  variable  L  can  be  recast  as 

z  ^ 

Lj+i  =  argmin  XL  \\L\l*  +  -\\L~(Y-  DAj - 2) 

L  2  fl  p 

=  argmin  \L  ||L||^  +  ^\\L-  P3  fF  ,  (11) 

L  z 

where  we  define  Pj  —  (Y  —  DAj  —  jtZj).  The  proximal  minimization  in  Eq.  11 
can  be  solved  via  the  singular  value  thresholding  operator21  in  which  we  first  define 
a  singular  value  decomposition  (U ,  A.  V)  =  svd(Y  —  Pj  +  jLZj). The  intermediate 
solution  of  Lj+i  is  then  determined  by  applying  the  soft-thresholding  operator  to 
the  singular  values:  L3+x  =  U S ,  where  the  soft-thresholding  operator  of 
A  over  —  is  element-wise  defined  for  each  5  in  the  diagonal  of  A  as  S ±l(8)  = 
max(\5\  —  0)  sgn(5). 

The  second  subproblem  to  update  A  can  be  rewritten  as 

c  1  2 

Aj+ 1  =  argmin  ||A||1(?  +  Ag  Y,\mF+fX\DA-(Y  ~Lj+i  +  -Z,)\\  .  (12) 

A  C=1  2  I1  F 

This  subproblem  is  a  convex  utility  function.  Unfortunately,  its  closed-form  solution 
is  not  easily  determined.  The  difficulties  come  from  not  only  the  joint  regulariza¬ 
tion  of  row-sparse  and  group-sparse  on  the  variable  A,  but  also  the  operation  over 
dictionary  transformation  DA  as  well  as  the  engagement  of  multiple  modalities.  In 
order  to  tackle  these  difficulties,  we  do  not  solve  for  an  exact  solution  of  Eq.  12. 
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Instead,  the  third  term  in  the  objective  function  is  approximated  by  its  Taylor  series 
expansion  at  Aj  (achieved  from  iteration  j )  up  to  the  second  derivative  order 

2 

DA-iY-Lj^  +  fZj)  2  «  DAj-(Y-Lj+1  +  -Zj)  (13) 

F  /I  F 

+2  {A  -  Aj,Tj)  +  -\\A-  Aj\\2F , 

where  6  is  a  positive  proximal  parameter  and  Tj  =  DJ  ( DAj  —  (Y  —  Lj+i+izj)) is 
the  gradient  at  Aj  of  the  expansion.  The  first  component  in  the  right-hand  side  of  Eq. 
13  is  constant  with  A.  Consequently,  by  replacing  Eq.  13  into  the  subproblem  Eq. 
12  and  manipulating  the  last  2  terms  of  Eq.  13  into  one  component,  the  optimization 
to  update  A  can  be  simplified  to 

c 

Aj+i  =  argmin||A||1?  + AG^||AC||F+  ^  || A  -  {Aj  -  6T j)\\2F  .  (14) 

The  derivation  in  the  second  line  of  Eq.  14  is  again  based  on  the  separable  structure 
of  1 1  •  1 1 2F  with  Tj  =  [T ] ,  T2j , . . . ,  T:j\ .  Note  that  while  1 1  •  1 1 2F  has  element-wise  separable 
structure  promoting  both  row-  and  column- separable  properties,  the  norm  ||  •  ||  F  does 
not  perform  any  separable  structure  and  H-^  has  separable  structure  with  respect 
to  rows  (i.e.,  ||  A\\ ,  =  Y2c=i  ll-^c||  i  q  with  A  being  the  row-concatenation  matrix  of 

all  Ac’s).  Applying  this  row-separable  property  into  the  first  and  third  terms  of  Eq. 
14,  we  can  further  simplify  it  to  solve  for  the  subcoefficient  matrix  of  each  class 
separately: 

(Aj+ i)c  =  argmin  \\Ac\\l  q  +  AG  \\AC\\F  +  ^  \\AC  -  (( Aj)c  -  e(Tj)c)\\2F  (15) 

(Vc  =  1, 2, ...,  C). 

The  explicit  solution  of  Eq.  15  can  then  be  solved  via  the  following  lemma. 

Lemma  1:  Given  a  matrix  R,  the  optimal  solution  to 

1  2 

min  ot\  11*11,  +  a2  ||*||F  +  -  ||*  -  R\\} 

X  z 

is  the  matrix  X 

*  =  f1^S  if  ||S||F  >  a2 

1  0  otherwise, 
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where  the  i-th  row  of  S  is  given  by 


Rt  if  \\Ri-\\q>  ai 


otherwise. 


(18) 


Furthermore,  Algorithm  1  is  guaranteed  to  provide  the  global  optimum  of  the  con¬ 
vex  program  in  Eq.  9  as  stated  in  the  following  theorem: 


Theorem  1:  If  the  proximal  parameter  9  satisfies  the  condition  <Jmax((D)7  D)  < 


max 


j)t  where  amaxf)  is  the  largest  eigenvalue  of  a  matrix,  then  {Aj,Lj}  generated 
by  algorithm  1  for  any  value  of  the  penalty  coefficient  fi  converges  to  the  optimal 
solution  { A ,  L}  of  Eq.  9  as  j  — *  oo. 

4.  Experimental  Results 

4.1  Data  Set  Description 

In  this  section,  we  perform  experimental  results  for  a  challenging  classification  of 
transient  acoustic  signals  produced  by  various  impulsive  sources  such  as  detona¬ 
tions  (e.g.,  launch  and  impact)  of  different  kinds  of  weapons  collected  by  ground- 
based  tetrahedral  microphone  arrays  (hence  there  are  4  measurements  for  each  data 
sample)  at  separate  locations  and  on  different  dates  and  times  at  the  sampling  rate 
of  1001.6  Hz.  This  classification  is  difficult  since  many  factors  affect  the  propaga¬ 
tion  of  signals  from  source  to  sensors.  Significant  noises  and  external  interfering 
signals  may  also  be  present  in  the  data  jungle.  In  particular,  we  note  several  chal¬ 
lenges:  1)  the  long  distances  between  receivers  and  sources;  2)  the  presence  of 
obstacles  on  the  terrain  and  nature  of  the  ground;  3)  the  various  amplitudes  of  the 
sources;  4)  the  different  times  of  day;  and  5)  the  meteorological  conditions  (e.g., 
cloud  cover,  wind,  and  humidity).  Several  preliminary  classification  results  have 
been  reported  using  well-known  conventional  classifiers,  such  as  Markov  switching 
vector  auto-regression  (MSVAR),  Gaussian  mixture  model  (GMM),  support  vector 
machine  (SVM),  hidden  Markov  model  (HMM),  or  the  combination  of  SVM  and 
HMM  (so-termed  SVM-HMM  or  SHMM). 

The  given  original  data  set  contains  a  total  of  7,420  acoustic  signatures  and  their 
associated  metadata.  After  a  clean-up  round  that  exhaustively  investigates  and  com¬ 
pares  the  true  signal  contents,  it  is  revealed  that  there  are  2,088  duplicate  samples. 
Moreover,  some  of  the  signatures  in  the  data  set  are  unusable  for  various  reasons, 
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such  as  1)  the  recording  equipment  on  some  signals  only  has  outputs  from  one  sen¬ 
sors;  2)  some  data  files  contain  incomplete  events  or  inaccurate  information  (e.g., 
the  starting  time  of  segmentation  is  even  after  its  ending  time);  and  3)  the  ampli¬ 
tudes  of  certain  recorded  signals  are  too  small.  In  total,  another  1,188  files  will  not 
be  used  in  our  experiments;  these  files  may  be  fixed  and  reused  in  our  future  works. 

The  4,144  cleaned  and  unique  files  are  reported  as  acoustic  signatures  of  9  different 
weapons  recorded  at  19  experiment  fields  in  which  the  set  of  fields  {02,  03,  06, 
11-15}  (so  termed  VI  fields)  are  categorized  as  desert  vegetation,  and  fields  {01, 
04,  05,  07-10,  16-19}  (known  as  V2  fields)  are  classified  as  grassland  vegetation. 
In  this  report,  we  classify  launches  and  impacts  for  3  weapon  types  (namely,  Wl, 
W2,  and  W3),  which  account  for  a  total  of  2,950  samples.  The  availability  of  these 
data  samples  allows  us  to  examine  the  effect  of  recording  environments  on  the  clas¬ 
sification  accuracy.  The  data  set  used  in  our  experiments  is  summarized  in  Table  1, 
which  shows  the  distribution  of  the  number  of  signatures  for  each  location/weapon 
type  combination. 

Table  1  Data  set  summary 


Acoustic  Data 

Wl 

W2 

W3 

Total 

Launch 

1,719 

106 

107 

1,932 

Impact 

918 

36 

64 

1,018 

Total 

2,637 

142 

171 

2,950 

4.2  Comparison  Methods 

In  this  report,  we  compare  the  classification  results  of  transient  acoustic  signals 
using  12  different  methods: 

•  MSVAR:  Markov  switching  vector  autoregressive.25 

•  GMM:  Gaussian  mixture  model. 

•  SLR:  sparse  logistic  regression. 

•  HMM:  hidden  Markov  model. 

•  S  VM:  support  vector  machine  classifier  (reporting  results  on  the  concatenated 
feature  vector  of  the  4  sensors). 
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•  SVM-HMM  (or  SHMM):  combined  SVM  and  HMM  model  where  HMM  is 
first  developed  to  capture  the  temporal  dynamics  of  time  series  of  transient 
acoustic  signals,  and  SVM  is  then  embedded  to  model  the  emission  probabil¬ 
ities  for  each  HMM.26 

•  JSR:  joint  sparse  representation  model. 

•  GJSR:  group-and-joint  sparse  representation.  This  model  is  simplified  from 
GJSR+L  model  in  which  the  low-rank  interference  term  is  set  to  be  zero  along 
the  optimization  procedure. 

•  JSR+L:  joint  sparse  representation  with  low-rank  interference. 

•  GJSR+L:  group-and-joint  sparse  representation  with  low-rank  interference 
model.  This  is  the  most  general  model  among  sparsity-based  technique. 

•  DNN:  deep  neural  network. 

•  DBN:  deep  belief  network. 


As  mentioned  previously,  among  the  classification  methods  above,  MSVAR,  GMM, 
HMM,  SVM,  and  SVM-HMM  have  been  previously  studied  for  the  acoustic  tran¬ 
sients  data  set.  Furthermore,  besides  sparsity-based  representation  techniques,  we 
also  develop  and  compare  the  classification  results  of  SLR  as  well  as  2  deep  learning 
models,  arguably  as  among  the  most  advanced  discrimination  techniques  in  recent 
years. 


Sparse  logistic  regression.  The  main  idea  of  this  model  is  to  associate  to  all  the 
training  data  in  the  training  dictionary  D  e  M;VxP  an  appropriate  coefficient  vector 
a  G  RN,  and  the  logistic  loss  is  taken  over  the  sum  of  training  samples.  Sparsity 
regularization  can  be  incorporated  into  the  optimization  to  retrieve  the  coefficient 
vectors  a.  This  regularization  implies  that  some  features  are  assumed  to  be  irrel¬ 
evant  for  classification,  thus  they  should  be  minimized  via  the  fj-norm  sparsity 
constraint.  In  particular,  the  probabilistic  model  of  the  logistic  regression27  is  for¬ 
mulated  as  follows: 


\dj) 


exp  (djayj) 

1  +  exp  (dja)  ’ 


(19) 
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where  the  feature  vector  dj  is  the  j-th  column  vector  of  the  training  dictionary  D. 
and  the  index  yj  receiving  values  0  or  1  indicates  the  class  label. 

To  recover  a’s,  a  maximum  likelihood  over  all  the  training  data  is  applied  together 
with  imposing  sparsity  on  a.  In  particular,  the  following  function  is  minimized: 

min C(a)  +  A  ||a|| x ,  (20) 

a 

where  the  loss  function  C(a)  is  defined  as 

p  p 

£(a)  =  yM,fa  -  log(x  +  exp(dJa))-  (21) 

j= 1  3=1 

Once  the  coefficient  vector  a  for  each  training  sample  is  obtained,  each  measure¬ 
ment  of  the  test  sample  is  then  assigned  to  a  class,  and  the  final  decision  is  made  by 
selecting  the  label  that  occurs  most  frequently. 

Deep  learning  models.  Recently,  deep  learning  or  deep  networks  have  increas¬ 
ingly  attracted  the  interest  of  researchers  in  various  diverse  disciplines  and  have 
been  applied  to  many  fields  like  computer  vision,  natural  language  processing,  au¬ 
tomatic  speech  recognition,  and  bioinformatics.  Remarkably,  deep-learning-based 
techniques  have  been  shown  to  produce  state-of-the-art  results  on  various  discrim¬ 
ination  tasks.28,29  However,  it  has  been  argued  that  in  order  to  perform  well,  these 
techniques  normally  require  a  very  large  and  diversified  training  data  set.  Experi¬ 
ments  with  deep  learning  architectures  are  critical  in  helping  us  comprehensively 
compare  the  classification  performance  with  state-of-the-art  techniques. 

In  this  report,  we  develop  2  deep  learning  architectures,  namely,  deep  neural  net¬ 
work  (DNN)  and  deep  belief  network  (DBN),  to  perform  classification  on  the  tran¬ 
sient  acoustic  data  set.  A  general  deep  network  architecture  is  an  artificial  neural 
network  with  a  number  of  hidden  layers  between  its  inputs  and  its  outputs  (Fig.  4). 
DNNs  are  feed-forward  networks  whose  hidden  layers  can  be  hierarchically  trained 
by  backpropagating  derivatives  of  a  cost  function  that  measures  the  discrepancy  be¬ 
tween  the  target  outputs  and  the  actual  outputs  produced  for  each  training  case.30 
On  the  other  hand,  DBNs  can  be  formed  by  stacking  multiple  layers  of  restricted 
Boltzmann  machines  (RBMs),  which  are  trained  in  a  greedy  fashion,31,32  and  the 
results  are  fine-tuned  using  backpropagation.  An  RBM  is  a  generative  undirected 
graphical  model  that  leams  connections  between  a  layer  of  visible  units  (represent- 
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ing  stochastic  binary  input  data)  and  a  layer  of  stochastic  binary  hidden  units  with 
the  restriction  that  no  visible-visible  or  hidden-hidden  units  are  connected. 


Fig.  4  A  general  deep  network  architecture. 


4.3  Experimental  Setups 

4.3.1  Segmentation  and  Feature  Extraction 

To  accurately  perform  classification,  it  is  necessary  to  extract  the  actual  events  from 
collected  signals.  A  recorded  signal  can  last  several  seconds.  However,  the  event 
occurs  within  a  small  interval,  typically  of  half  a  second  duration  or  less.  After  seg¬ 
mentation,  the  cepstral  features33  are  extracted  in  each  segment.  Cepstral  features 
have  been  proven  to  be  very  effective  in  speech  recognition  and  acoustic  signal  clas¬ 
sification.  The  power  cepstrum  of  a  signal  y(t),  formulated  as  |F_1(log10(|F(r/(f))|2))  |2 
where  F  is  a  Fourier  transform,34  captures  the  rate  of  signal  change  with  time  with 
respect  to  a  different  frequency  band.  We  discard  the  first  cepstral  coefficient  (cor¬ 
responding  to  the  zero-frequency  component)  and  use  the  next  50  coefficients  for 
classification. 

4.3.2  Training  and  Testing  Splits 

We  evaluated  the  performance  of  different  training  and  testing  splitting  frameworks: 

•  Randomly  splitting  the  whole  data  set  into  training  and  testing  sets  with  an 
equal  amount  of  samples  (i.e.,  half  for  training  and  half  for  testing).  This  is  the 
optimal  setup  when  training  samples  of  diversified  battlefields  are  available. 
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This  means  that  the  results  of  this  setting  should  deliver  the  upper-bound 
results  in  real-world  practical  applications.  Most  of  the  published  experiments 
so  far  have  been  evaluated  on  this  half-half  splitting  setup. 

•  Splitting  the  training  and  testing  sets  by  opposite  vegetation  environments. 
This  experimental  setup  consists  of  2  different  scenarios  that  perform  clas¬ 
sification  from  samples  of  VI  vegetation  (i.e.,  desert  vegetation)  as  training 
and  V2  vegetation  (i.e.,  grassland  vegetation)  as  testing  and  vice  versa.  These 
are  viewed  as  pessimistic  setups  (corresponding  to  lower-bound  results)  since 
training  and  testing  samples  are  completely  nonoverlapped  in  date,  time,  lo¬ 
cation,  experiment  field,  and  vegetation  environment. 

4.3.3  Data  Imbalance 

As  can  be  seen  from  Table  1,  there  is  a  severe  class  imbalance  among  the  number 
of  samples  in  different  projectiles.  More  specifically,  the  number  of  W1  samples 
equals  89.4%  of  the  total  amount  of  available  signatures  while  W2  and  W3  sam¬ 
ples  account  for  only  4.8%  and  5.8%  of  all  signatures,  respectively.  This  means  that 
there  are  at  least  15  times  more  signatures  available  for  W1  than  for  the  other  2 
weapons.  Moreover,  while  the  imbalance  between  data  samples  of  the  2  actions, 
launch  and  impact,  is  less  serious  compared  with  that  of  weapon  types,  the  collec¬ 
tion  of  launch  actions  is  still  almost  twice  that  of  the  impact  signatures.  In  total,  the 
number  of  signatures  coming  from  the  largest-sized  class  (i.e.,  Wl-launch  class)  is 
approximately  48  times  more  than  the  smallest  counterpart  (i.e.,  W2-impact  class). 

This  imbalance  property  of  the  data  set  is  a  critical  problem  that  influences  the  clas¬ 
sification  results  of  all  methods  since  a  test  sample  is  more  likely  to  be  categorized 
into  majority  classes  which  have  more  volume  with  high  density  and  contain  more 
diversified  information.  Particularly,  this  problem  seriously  affects  the  results  of 
sparsity-based  techniques  because  the  class  assignment  steps  of  these  methods  are 
decided  by  the  coefficients  vectors,  which  lack  balance  in  sizes  among  data  samples 
in  the  training  classes,  hence  their  outputs  are  greatly  asymmetric  within  different 
classes.  Therefore,  for  the  4  sparsity-based  methods,  we  process  a  balancing  step 
to  balance  data  samples  in  the  training  set  to  the  median  number  of  all  class  sizes. 
This  step  includes  shrinking  the  majority  classes  by  randomly  under- sampling  to 
the  median  size  and  expanding  the  minority  classes  by  randomly  over-sampling  to 
the  same  class  size.  This  balancing  strategy,  despite  generating  over-fitting  due  to 
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multiple  tied  instances  in  minority  classes  and  loss  of  features  in  majority  classes, 
provides  more  equally  distributed  classification  accuracy  rates  among  classes  on 
sparsity-based  methods.  On  the  other  hand,  the  data  balancing  may  slightly  vary 
the  performance  of  the  other  classification  techniques  but  would  not  show  system¬ 
atic  changes  on  the  overall  results.  Therefore,  in  this  report,  all  results  generated  by 
sparsity-based  models  are  performed  with  balancing  training  samples  while  those 
of  the  other  methods  are  reported  with  the  full  data  set. 

4.4  Comparison  of  Classification  Performance 

In  this  section,  we  perform  extensive  experimental  results  on  the  classification  of 
the  transient  acoustic  data  set  for  2  main  classification  problems: 

•  Four-class  problem  to  classify  Wl-launch,  Wl-impact,  W2-launch,  and  W2- 
impact 

•  Six-class  problem  to  classify  Wl-launch,  Wl-impact,  W2-launch,  W2-impact, 
W3-launch,  and  W3-impact 

Note  that  a  W3  munition  is  a  special  type  of  W2.  Therefore,  in  the  4-class  problem, 
we  combine  the  signatures  of  W2  and  W3  weapon  types  and  just  consider  them  as 
W2  while  the  6-class  problem  further  discriminates  between  W2  and  W3  weapons. 

4.4.1  Classification  Results  for  4-class  Problem 

Now  we  demonstrate  the  classification  results  for  the  4-class  classification  prob¬ 
lem,  particularly  in  discriminating  between  the  launches  and  impacts  of  W1  and 
W2  projectiles.  The  detail  confusion  matrices  of  all  competing  methods  for  the  3 
splitting  setups — 1)  half-half,  2)  VI  as  training  and  V2  as  testing,  and  3)  V2  as 
training  and  VI  as  testing — are  visualized  in  Figs.  5-7,  respectively.  The  heights  of 
all  bars  are  normalized  with  respect  to  the  actual  number  of  testing  samples  in  that 
class  (i.e.,  each  row  is  normalized  to  have  sum  to  unity).  Furthermore,  the  black 
number  on  the  top  of  each  bar  shows  the  true  number  of  samples  that  are  classified 
into  the  associated  class  while  the  red  number  displays  the  percentage  of  the  num¬ 
ber  of  predicted  samples  over  the  total  number  of  testing  samples  in  that  class.  The 
results  on  the  diagonal  of  each  matrix  (surrounded  by  rectangles  drawn  with  solid 
red  lines)  demonstrate  accurate  classification  rates  of  the  4  classes.  Also,  the  results 
from  the  4  sparsity-based  techniques  are  surrounded  by  rectangles  drawn  with  solid 
blue  lines. 
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Fig.  5  Comparison  of  4-class  confusion  matrices  with  random  half-half  separation  of  training  and  testing  sets 
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Fig.  6  Comparison  of  4-class  confusion  matrices,  VI  as  training  and  V2  as  testing 
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Fig.  7  Comparison  of  4-class  confusion  matrices,  V2  as  training  and  VI  as  testing 


Figures  8  and  9  summarize  the  classification  performance  of  all  comparison  meth¬ 
ods  for  the  2  splitting  setups:  VI  as  training  and  V2  as  testing  (Fig.  8)  and  V2  as 
training  and  VI  as  testing  (Fig.  9).  We  report  both  weighted  results  in  relation  to  the 
actual  class  sizes  and  non-weighted  results,  which  simply  average  the  classification 
rates  of  the  4  classes  without  taking  into  account  the  class  sizes.  In  addition,  the 
results  of  each  opposite-vegetation  splitting  setup,  displayed  in  black  percentage 
numbers,  are  compared  to  those  of  the  optimistic  half  for  training  and  half  for  test¬ 
ing  setup,  exhibited  by  red  numbers.  The  classification  rates  of  the  half-half  setting 
consistently  exceed  those  of  the  opposite-vegetation  splits  by  around  10%. 


(a)  (b) 

Fig.  8  Comparison  of  classification  performance  of  4-class  problem  with  VI  as  training  and  V2 
as  testing  vs.  half-half  split  of  training  and  testing  sets:  a)  weighted  results  and  b)  non-weighted 
results 


(a)  (b) 

Fig.  9  Comparison  of  classification  performance  of  4-class  problem  with  V2  as  training  and  VI 
as  testing  vs.  half-half  split  of  training  and  testing  sets:  a)  weighted  results  and  b)  non-weighted 
results 


Both  Figs.  8  and  9  show  that  sparsity-based  techniques  generally  perform  among 
the  best,  in  which  the  GJSR+L  model  always  yields  the  best  classification  rates 
for  both  weighted  average  and  non-weighted  average  in  all  3  splitting  setups,  il- 
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lustrating  the  efficacy  of  its  group-and-joint  sparsity  structure  as  well  as  low-rank 
interference.  On  the  other  hand,  deep  network  architectures  provide  rates  similar  to 
traditional  classification  techniques  and  are  surpassed  by  GJSR+L  by  more  than  6% 
in  weighted  results  and  10%  in  non-weighted  results.  Among  the  2  deep  learning 
methods,  DBN  is  slightly  superior  to  DNN  in  all  cases.  The  main  reasons  for  the 
moderate  results  of  these  techniques  are  the  limitation  and  nondiversity  of  the  data 
set  as  well  as  the  harmful  effects  of  noise  and/or  external  interferences,  such  as  the 
presence  of  propagation  effects  in  recorded  signals. 

4.4.2  Classification  Results  for  6-class  Problem 

In  this  section,  we  further  compare  the  performance  of  the  12  competing  classifica¬ 
tion  techniques  on  the  6-class  problem  which  discriminates  between  2  detonations 
(launches  and  impacts)  of  3  projectiles.  The  detail  confusion  matrices  for  the  3 
training-testing  splitting  scenarios  are  exhibited  in  detail  in  Figs.  10-12,  and  their 
weighted  and  non-weighted  classification  results  are  encapsulated  in  Figs.  13  and 
14.  We  can  again  observe  similar  performance  orders  among  classification  tech¬ 
niques  in  competition,  in  which  GJSR+L  and  other  sparsity-based  models  are  su¬ 
perior  to  conventional  classifiers  as  well  as  the  2  deep  network  architectures.  The 
training  data  samples  are  even  more  limited  than  the  4-class  problem  case  in  some 
classes.  Specifically,  no  more  than  25  training  samples  are  available  in  the  4  classes 
W2-launch,  W2-impact,  W3 -launch,  and  W3 -impact  for  the  V2  as  training  and  VI 
as  testing  split,  in  which  the  W3-impact  class  only  contains  9  training  signatures. 

In  addition,  we  compare  the  classification  rates  of  4-class  and  6-class  problems  in 
Fig.  15,  where  the  results  are  averaged  over  the  3  splitting  setups.  The  black  num¬ 
bers  display  the  classification  rates  in  percentage  resulted  by  6-class  problem  set 
while  the  red  numbers  display  outputs  of  the  4-class  case.  The  performance  discrep¬ 
ancies  between  the  2  problems  are  varied  by  methods  but  are  typically  small  and 
stay  in  the  range  of  2%-4%.  This  means  that  W3  signatures  contain  significantly 
discriminative  features  with  signals  of  other  W2-weapon  types,  hence  should  be 
categorized  in  a  separated  class. 
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Fig.  10  Comparison  of  6-class  confusion  matrices  with  random  half-half  separation  of  training  and  testing  sets 
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Fig.  11  Comparison  of  6-class  confusion  matrices,  VI  as  training  and  V2  as  testing 
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Fig.  12  Comparison  of  6-class  confusion  matrices,  V2  as  training  and  VI  as  testing 


|  MSVAR  |  |  SLR  |  |SVM  |  |JSR  □  JSRL  | 
|GMM  |  1 HMM  | — ]SHMM  |M|GJSR  | — |GJSRL  j 


|  DNN 
I DBN 


| MSVAR  fgjSLR  nSVM  QUBR  USRL 
|GMM  □HMM  □  SHMM  □GJSR  | — |GJSRL 


(a)  (b) 

Fig.  13  Comparison  of  classification  performance  of  6-class  problem  with  VI  as  training  and 
V2  as  testing  vs.  half-half  split  of  training  and  testing  sets:  a)  weighted  results  and  b)  non- 
weighted  results 


(a)  (b) 

Fig.  14  Comparison  of  classification  performance  of  6-class  problem  with  V2  as  training  and 
VI  as  testing  vs.  half-half  split  of  training  and  testing  sets:  a)  weighted  results  and  b)  non- 
weighted  results 


| MSVAR  HSLR  I  SVM  |  | JSR  1JSRL 

|GMM  QHMM  □]SHMM  □GJSR  HGJSRL 


(a)  (b) 

Fig.  15  Comparison  of  classification  results  of  4-class  problem  vs.  6-class  problem:  a)  non- 
weighted  results  and  b)  weighted  results 


4.5  Discussions  on  Classification  Strengths  and  Weaknesses. 

Section  4.3.3  has  provided  a  comprehensive  comparison  on  the  performance  of  the 
12  classification  methods.  In  this  section,  we  will  discuss  general  strengths  and 
weaknesses  of  the  classifiers  under  examination,  focusing  on  both  performance  and 
complexity.  From  the  performance  perspective,  the  results  from  the  previous  section 
demonstrate  the  superior  performance  of  sparsity-based  models.  On  the  negative 
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side,  however,  these  methods  are  expensive  in  both  training  and  testing  procedures. 
Furthermore,  the  results  are  highly  dependent  on  the  weighting  parameters,  such  as 
A l  and  Xq,  which  encode  low-rank  and  group- structure  information. 

Table  2  summarizes  the  advantages  and  disadvantages  in  terms  of  complexity  and 
classification  accuracy  of  the  12  classification  methods.  This  does  not  compare 
methodology  in  general;  instead,  the  classifiers  are  only  evaluated  based  on  their 
specific  performance  on  the  transient  acoustic  data  set. 

Table  2  Advantages  and  disadvantages  of  the  competing  methods 


Methods 

Strengths 

Weaknesses 

Other  comments 

MS  VAR 

Quite  low  computa¬ 
tional  time  for  testing. 

Requires  multiple  pre¬ 
defined  parameters  of 
transition  stages. 

MSVAR  was  previously  studied  and 
is  a  good  reference  for  performance 
comparison. 

GMM 

(i)  Simple  and  easy 
to  implement;  (ii)  low 
computation  and  small 
memory  allocation. 

Requires  predefin¬ 

ing  the  number  of 
mixtures  of  the  model. 

GMM  is  a  simple  model  that  can 
yield  generally  good  results. 

SLR 

Training  time  is  gener¬ 
ally  fast. 

Classification  accu¬ 
racy  is  quite  low. 

This  method  is  moderate  in  both  per¬ 
formance  and  computation. 

SVM 

Supporting  kernels, 
hence  can  model 

nonlinear  relations. 

Expensive  in  both 
training/testing  time 
and  memory  usage. 

This  is  a  typical  deterministic,  dis¬ 
criminative  model. 

HMM 

Low  complexity  in 
both  running  time  and 
memory  usage. 

Requires  predefin¬ 

ing  the  number  of 
stages  and  number  of 
mixtures  in  each  state. 

This  is  a  typical  generative,  proba¬ 
bilistic  model. 

SVM- 

HMM 

(i)  Good  classification 
performance;  (ii)  test¬ 
ing  time  is  generally 
fast. 

Long  training  time. 

Performance  is  second  best  after 
sparsity-based  methods  while  re¬ 
quiring  much  less  computational 
time. 

Sparsity- 

based 

methods 

(i)  Highest  clas¬ 
sification  perfor¬ 

mance  among  com¬ 
peting  methods; 

(ii)  robust  with 

noi  se/interferenc  e ; 

(iii)  low  memory 
allocation  for  testing. 

(i)  Very  long  test¬ 
ing  time;  (ii)  results 
are  dependent  on 
parameter  selection; 
(iii)  require  extensive 
cross-validation  com¬ 
putation  for  parameter 
learning. 

The  4  sparsity-based  methods  pro¬ 
vide  slight  variance  on  classification 
results  and  computational  complex¬ 
ity,  in  which  JSR  has  lowest  classifi¬ 
cation  rates  but  is  least  dependent  on 
parameter  selections  while  GJSR+L 
performs  the  best  but  is  most  depen¬ 
dent  on  parameter  selections. 

Deep 

Network 

Architec¬ 

tures 

Fast  testing  time. 

(i)  Long  training 
procedure;  (ii)  require 
large  training  set. 

Deep  learning  methods  do  not  per¬ 
form  very  well  on  the  available  data 
set.  However,  the  performance  may 
significantly  improve  if  many  more 
training  samples  are  available. 
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Furthermore,  we  demonstrate  the  computational  complexity  of  the  12  methods,  test¬ 
ing  for  the  4-class  problem  and  half-half  split  in  Table  3.  In  particular,  Table  3 
exhibits  the  running  time  in  milliseconds  and  the  maximum  memory  allocation  in 
kilobytes  calculated  specifically  for  the  classification  functions  of  one  testing  sam¬ 
ple.  All  the  calculations  are  conducted  and  timed  on  the  same  desktop  with  an  Intel 
quad-core  3.60-GHz  CPU  that  has  16-GB  memory,  running  Windows  7  and  Matlab 
version  8.3.0.  Note  that  these  running  times  would  depend  on  the  parameters  chosen 
(e.g.,  number  of  states,  correlation  order).  A  detailed  study  of  speed  for  each  algo¬ 
rithm  is  outside  the  scope  of  this  report.  In  addition,  it  might  not  provide  the  final 
word  in  the  context  of  continuing  hardware  progress.  It  is  also  worth  mentioning 
that  these  numbers  are  for  the  testing  side  after  data  training  and  parameter  learning 
have  already  been  processed.  Moreover,  while  the  training  complexity  is  difficult  to 
quantify,  its  general  analysis  has  been  provided  in  Table  2. 

Table  3  Computational  complexity  and  memory  usage  comparison 


Methods 

Running  time  per  testing 

Maximum  memory 

sample  (ms) 

allocation  (KBs) 

MS  VAR 

18.2 

29.742 

GMM 

9.4 

5,566 

SLR 

55.4 

128,262 

HMM 

5.2 

11,506 

SVM 

181.2 

220,386 

SVM-HMM 

20.7 

116,714 

JSR 

938 

2,096 

GJSR 

1,062 

2,122 

JSR+L 

1.286 

2,138 

GJSR+L 

1.308 

2,264 

DNN 

0.7 

4,838 

DBN 

0.6 

6,972 

5.  Conclusions  and  Future  Work 

In  this  technical  report,  we  have  proposed  a  general  sparsity-based  framework  for 
the  classification  of  transient  acoustic  signals;  this  framework  enforces  various  spar¬ 
sity  structures  like  joint-sparse  or  group-and-joint-sparse  within  measurements  of 
multiple  acoustic  sensors.  We  further  robustify  our  models  to  deal  with  the  pres¬ 
ence  of  dense  and  large  but  correlated  noise  and  signal  interference  (i.e.,  low-rank 
interference).  Another  contribution  is  the  implementation  of  deep  learning  archi- 

Approved  for  public  release;  distribution  is  unlimited. 


30 


tectures  to  classify  the  transient  acoustic  data  set.  Extensive  experimental  results 
are  included  in  the  report  to  compare  the  classification  performance  of  sparsity- 
based  and  deep-network-based  techniques  with  conventional  classifiers,  such  as 
MSVAR,  GMM,  SVM,  HMM,  SLR,  and  the  combined  HMM-SVM  methods  for 
2  experimental  sets  of  4-class  and  6-class  classification  problems.  Based  on  rela¬ 
tive  performance  and  overall  computational  requirements,  we  would  pick  the  JSR 
as  a  stop-gap  solution.  Its  performance  is  among  the  leading  pack  while  its  training 
and  testing  time  is  moderate.  However,  the  reality  is  that  all  classifiers  drop  around 
10%  in  accuracy  when  tested  on  unseen  environments.  This  is  clearly  a  fundamental 
problem  that  is  still  open.  Much  work  remains  to  be  done,  as  is  discussed  next. 

In  the  future,  we  plan  to  work  on  1)  searching  for  invariant  features  in  the  z-domain; 
2)  developing  a  collaborative  multi-array  multi-sensor  classification  framework, 
which  takes  into  consideration  the  correlations  as  well  as  complementary  infor¬ 
mation  among  data  samples  of  a  single  event  recorded  by  multiple  sensor  arrays 
at  different  locations;  3)  learning  dictionary  instead  of  using  an  off-the-shelf  dictio¬ 
nary  to  improve  discrimination  characteristics;  4)  using  online/on-the-fly  dictionary 
update;  and  5)  studying  unsupervised  transfer  learning  methods  to  exploit  available 
information  from  unlabeled  data  samples  and  thus  further  improve  classification  re¬ 
sults.  The  detailed  approaches  are  presented  in  Subsections  5. 1-5.5.  Furthermore, 
some  other  tasks  under  our  future  investigations  include  6)  re-running  experiments 
on  the  full  and  clean  data  set;  7)  developing  more  robust  deep  network  methods;  8) 
learning  about  other  invariant  feature  spaces,  such  as  the  symbolic  dynamic  filtering 
features35  that  can  effectively  encapsulate  time-series  information;  and  9)  publish¬ 
ing  results  in  journals  and/or  present  at  conferences  in  acoustics  fields. 

5.1  Invariant  Features  Search  in  the  Z-Domain 

A  fundamental  obstacle  to  long-range  classification  is  the  effect  of  the  propagation 
channel  on  the  acoustic  signal.  Under  the  linear  regime,  it  is  reasonable  to  assume 
that  the  channel  is  linear  time-invariant.  We  can  further  assume  that  the  signals 
themselves,  after  propagating  far  enough  from  the  point  of  explosion,  are  small 
enough  to  be  in  the  linear  regime  and  can  be  modeled  by  a  linear  time-invariant 
system. 
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Under  these  assumptions,  the  measurements  can  be  modeled  in  the  ^-domain  as 


lE-iz-XjUT-iz-Zj 

x(z)  =  H(z)M(z)  +  N(z)  =  - -  +  N(z),  (22) 

uj=i z  -  *3  nJ=i  ^  -  p] 

where  H(z)  is  the  channel,  M (z)  is  the  signal  of  interest,  and  N(z)  is  the  additive 
noise.  Xj  and  are  the  zeros  and  poles  of  H(z),  and  z3  and  pj  are  the  zeros  and 
poles  of  M(z). 

Clearly,  the  “natural”  invariant  features  are  the  poles  and  zeros  of  M(z),  as  any 
H(z)  will  leave  them  unchanged,  as  long  as  there  is  no  pole  zero  cancellation.  To 
find  them,  one  will  need  to  remove  the  poles/zeros  of  H(z)  and  N(z),  effectively, 
to  factor  out  channel  and  signal.  The  answer  is  not  obvious  if  one  only  measures 
X(z).  However,  it  is  conceivable  to  use  past  noise  measurements  and  multiple  mea¬ 
surements  of  the  same  signal  to  estimate  M(z ). 

A  simple  approach  is  to  estimate  all  the  poles  and  zeros  of  X{z)  and  compare  the 
pole-zero  pattern  to  known  ones,  using  the  known  patterns  as  templates.  There  are 
many  robust  approaches  to  solving  for  the  poles  and  zeros  of  X(z).  One  can  use 
the  state-space  realization  from  noisy  impulse  response.36  Another  option  is  to  use 
the  system  identification  approach.  We  have  already  implemented  and  tested  the 
first  approach.  However,  to  keep  the  focus,  we  will  report  our  preliminary  results  in 
another  paper. 

If  accurate,  a  practical  by-product  is  a  pole-zero  pattern  that  can  be  displayed  to 
human  operators,  who  could  then  use  their  own  judgment  to  do  the  classification. 
The  challenge  with  this  approach  is  the  order  selection  and  unstable  realization. 

5.2  Collaborative  Multi-array  Multi-sensor  Classification 

In  this  report,  we  have  theoretically  and  empirically  demonstrated  the  effectiveness 
of  incorporating  correlation  across  different  sensors  attached  to  the  same  sensor 
array.  In  practice,  multiple  sensor  arrays  are  stationed  at  different  locations,  concur¬ 
rently  listening  to  detonation  events;  hence,  this  information  can  be  accommodated 
to  further  improve  classification  performance.  In  other  words,  we  exploit  not  only 
correlation  among  sensors  of  the  same  array,  but  also  complementary  information 
across  different  sensor  arrays.  Mathematically,  suppose  there  are  K  sensor  arrays 
in  the  system,  Yk  is  the  corresponding  measurement  matrix  collected  by  the  k- th 
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array  (k  =  1,2 and  Y  =  [Y'1,Y'2,  ...,YK]  is  the  concatenated  matrix  of 
all  sensors  and  arrays.  The  GJSR+L  model  can  then  be  extended  to  fuse  multiple 
arrays  for  a  collaborative  classification  as  follow: 


K  K  C 

E  KIIm + E E  Klb + ^  iiiii. 

’  k= 1  fc=l  c=  1 

s.t.  Y  =  DA  +  L, 

or  equivalently 

K  C 

min  Ill'll i,q  +  XgY1  \\MF  +  ^\\L\l 

’  k= 1  c=l 

s.t.  y  =  ZM  +  L, 


(23) 


(24) 


where  A  =  [Ai;  A2; ...;  Ac]  is  the  row-concatenated  coefficient  matrix  of  different 
classes,  A  =  [A1,  A2,  ...,AK]  is  the  column-concatenated  coefficient  matrices  of 
K  sensor  arrays,  and  A1;:  is  the  coefficient  submatrix  associated  with  class  c  and 
array  k.  The  minimization  of  the  first  component  in  Eq.  24  can  be  phrased  as  en¬ 
forcing  the  row-sparsity  property  within  each  sensor  array  while  different  arrays 
do  not  necessarily  share  common  sparsity  patterns.  On  the  other  hand,  the  group- 
sparsity  function  (i.e.,  the  second  term  in  Eq.  24)  promotes  a  group  structure  within 
each  array  as  well  as  across  multiple  arrays.  Finally,  a  nuclear  norm  minimization 
is  devoted  to  L  to  encourage  low-rank  property  on  the  interference  among  all  mea¬ 
surements. 


5.3  Dictionary  Learning  for  Sparse  Coding  on  Acoustic  Signals 

The  sparsity-based  techniques  proposed  in  this  report  have  been  based  on  the  as¬ 
sumption  that  acoustic  signatures  usually  lie  in  low-dimensional  subspaces  of  a  de¬ 
terministic  dictionary,  which  is  constructed  by  directly  concatenating  the  acquired 
training  samples.  However,  it  is  probable  that  learning  the  dictionary  instead  of  us¬ 
ing  off-the-shelf  training  samples  will  improve  classification  performance.  It  means 
that  a  learning  procedure  will  be  added  to  the  training  side  where  not  only  a  spar¬ 
sity  constraint  is  enforced  on  the  coefficient  matrix,  but  also  a  dictionary  is  learned 
in  parallel  to  increase  the  sparsity  characteristic  and  better  capture  the  discrimi¬ 
nation  property.  This  learned  dictionary  will  then  be  used  for  the  classification 
of  testing  samples  instead  of  a  deterministic  dictionary.  Given  the  training  data 
Y  =  [y i, 2/2,  a  general  dictionary  learning  (DL)  method  is  designed  to  si- 
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multaneously  learn  a  dictionary  D  and  the  corresponding  sparse  coefficients  A  as 
follows37: 

min  ]-  || V  -  DA\\2f  +  A  ||^4|| x .  (25) 

DA  Z 

The  nonconvex  optimization  problem  in  Eq.  25  is  usually  solved  by  iterating  be¬ 
tween  sparse  coding  and  dictionary  updating.  In  the  sparse  coding  stage,  the  sparse 
coefficient  A  is  found  with  respect  to  a  fixed  dictionary  D.  In  the  dictionary  updat¬ 
ing  stage,  each  dictionary  atom  dj  in  D  is  updated  using  only  data  with  nonzero 
sparse  coefficients  on  index  j.  This  subproblem  can  be  solved  by  either  block  co¬ 
ordinate  descent37  or  singular  value  decomposition.38  Furthermore,  we  propose  to 
design  the  dictionary  and  sparse  code  with  more  discriminating  properties  by  en¬ 
forcing  extra  structural  constraints  /a(-)  and  J'e>  (•) ,  leading  to 

min  ]-  || Y  -  DA\\2F  +  \AfA {A)  +  A DfD(D),  (26) 

D.A  Z 

where  the  structural- sparsity  promoting  function  fA  ( • )  enforces  the  correlation  along 
multiple  measurements,  which  can  be  element-wise-sparse,  joint-sparse,  or  group- 
sparse  functions  as  previously  discussed  in  this  report;  and  /©(•)  forces  the  subdic¬ 
tionaries  of  different  classes  to  be  as  incoherent  as  possible.39  For  simplicity,  the 
presence  of  low-rank  interference  as  well  as  the  incorporation  of  multiple  sensor 
arrays  are  omitted  from  the  model  description  in  Eq.  26. 

5.4  Online  Dictionary  Learning  Update 

In  the  long  run,  it  is  desired  to  develop  a  system  that  can  automatically  update 
the  dictionary  with  more  and  more  training  samples  continuously  collected  on  the 
battlefields  (both  labeled  and  unlabeled).  For  the  sparsity  models  with  deterministic 
dictionaries,  the  dictionary  update  procedure  typically  includes  2  steps:  the  first  step 
selectively  adds  dictionary  atoms  collected  on  the  fields  and  labels  them  to  the  cor¬ 
responding  classes,  and  the  second  step  involves  relearning  the  models’  parameters 
via  cross-validation  technique.  This  normally  requires  very  high  computations,  es¬ 
pecially  when  the  training  set  is  large  and  thus  is  impractical  to  process  in  real  time. 
Therefore,  we  propose  a  dynamic  dictionary  updating  framework  based  on  a  DF 
approach  that  can  capture  the  representation  and  the  label  of  the  signal  on-the-fly. 

Researchers  have  proposed  to  update  online  dictionaries  by  block-coordinate  de¬ 
scent  methods  with  warm  restarts,40  recursive  least  squares,41  or  an  efficient  feed¬ 
forward  architecture  42  Another  related  line  of  research  is  learning  algorithms  on 
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manifolds  for  independent  component  analysis.  Researchers  have  developed  effi¬ 
cient  conjugate  gradient  or  steepest  descent  algorithms  leveraging  differential  ge¬ 
ometry  techniques.43  Our  proposed  online  dictionary  update  may  include  several 
approaches  to  remove  the  outdated  dictionary  atoms  in  a  chosen  subdictionary  or 
add  new  dictionary  atoms  to  capture  more  on-the-fly  features.  First,  a  linear  dy¬ 
namic  system  (LDS)  model  describing  the  signal  evolution  can  be  used  to  capture 
the  transformation  of  the  dictionary.  This  quadratic  problem  can  be  formulated  as 
follows: 

min  - 1 \Y t  —  DtAt\\2F  +  y| \Dt  —  BtDt~ \ \ \2F  +  A d/d(-D),  (27) 

Dt  A 

where  Dt  is  the  updated  dictionary  at  time  t,  B,  is  used  to  capture  the  subspace 
deformation,  and  7  is  a  trade-off  weighting  parameter  to  balance  the  2  terms.  Here 
we  only  change  the  dictionary  updating  stage  in  the  DL  model  in  Eq.  26  while  the 
sparse  coding  stage  remains  unchanged  as  described  in  Section  5. 

Another  approach  is  to  update  the  dictionary  by  descent  on  the  oblique  manifold.44 
The  oblique  manifold  is  the  set  of  vectors  with  unit  Frobenious  norm,  which  is 
exactly  the  constraint  typically  enforced  on  DL.  We  propose  to  employ  the  gradient 
on  the  oblique  manifold  to  dynamically  update  the  dictionary,  where  the  gradient 
can  be  derived  as 

V/(D,+i)  =  P(f(D,))  =  f'(Dt)  -  D,  ddiag (Djf'(Dt)).  (28) 

Here,  P(-)  captures  the  orthogonal  projection  onto  the  tangent  space  of  the  dictio¬ 
nary  manifold,  /(•)  represents  the  loss  function,  and  ddiag(-)  sets  all  off-diagonal 
entries  of  a  matrix  to  zero.  Notice  that  in  our  problem  setup,  the  structure  usually 
indicates  the  boundary  of  subspaces  with  different  labels.  Therefore,  we  will  fur¬ 
ther  incorporate  this  information  into  the  dictionary  update  procedure  to  improve 
performance. 

5.5  Unsupervised  Transfer  Learning 

One  more  approach  that  we  plan  to  investigate  in  the  future  is  the  unsupervised 
transfer  learning  that  deals  with  the  problem  of  automatically  labeling  newly  col¬ 
lected  signals,  then  using  them  as  new  training  samples  for  the  classification  of 
acoustic  signals.45,46  This  is  a  critical  problem  since,  in  practice,  many  more  acous¬ 
tic  samples  can  be  collected  in  the  field  but  only  a  small  subset  of  those  can  be 
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manually  labeled.  The  question  is  how  to  make  use  of  these  unlabeled  samples  to 
further  improve  classification  performance.  This  motivates  us  to  study  the  unsu¬ 
pervised  transfer  learning  that  takes  into  account  the  prior  knowledge  of  labeled 
training  examples  (source  tasks)  in  a  transformed  domain  to  develop  a  hypothesis 
for  the  set  of  unlabeled  samples  (target  tasks) — that  is,  probabilistically  assigning 
every  unlabeled  sample  to  a  specific  class.  This  can  be  done  by  using  a  transforma¬ 
tion  matrix  to  transfer  both  the  source  and  target  data  onto  a  common  subspace  in 
which  each  target  datum  can  be  linearly  reconstructed  by  the  data  from  the  source 
domain.  This  problem  can  be  formulated  as 

mm\\TYu-TYLA\\2F,  (29) 

T,A 

where  Y L  and  Y u  are  the  corresponding  labeled  (source)  and  unlabeled  (target) 
measurement  matrices;  T  is  the  linear  transformation,  and  A  is  the  coefficient  ma¬ 
trix  that  linearly  represents  target  tasks  by  the  source  tasks  in  the  common  sub¬ 
space.  Furthermore,  the  reconstruction  coefficient  matrix  A  should  have  a  block- 
wise  structure  that  promotes  both  low-rank  and  sparsity  properties  on  A.  Therefore, 
the  unsupervised  transfer  learning  for  classification  can  be  recast  as  the  following 
optimization: 

min  i  II TYu  -  TY  LAfF  +  A,  H^H,  +  AL  ||A||„ .  (30) 

The  low-rank  constraint  penalized  by  the  nuclear  norm  H-A^  encourages  the  data 
correlation  among  samples  of  the  same  classes  while  the  sparsity  constraint 
is  helpful  to  preserve  the  data  local  structure  such  that  each  target  sample  can  be 
well  reconstructed  by  only  a  few  samples  from  the  source  domain.  The  benefits 
of  solving  the  transfer  learning  problem  (Eq.  30)  are  2-fold.  First,  it  efficiently  ex¬ 
ploits  information  on  the  unlabeled  signatures  and  can  thus  automatically  update 
the  dictionary  on  the  battlefield.  Second,  by  transfer  learning,  we  can  enforce  the 
consistency  of  source  and  target  samples  in  the  transferred  domain  even  when  there 
are  uncommon  features  (such  as  propagation  effects,  signal  interferences,  or  veg¬ 
etation)  between  the  2  sets.  This  may  eventually  lead  to  a  possible  answer  to  the 
invariant  feature  search  problem  for  transient  acoustic  signals,  which  still  remains 
unsolved. 
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List  of  Symbols,  Abbreviations,  and  Acronyms 


ADMM 

alternating  direction  method  of  multipliers 

ARL 

US  Army  Research  Laboratory 

DBN 

deep  belief  network 

DL 

dictionary  learning 

DNN 

deep  neural  network 

GJSR 

group-and-joint  sparse  representation 

GJSR+L 

group-and-joint  sparse  representation  with  low -rank  interference 

GMM 

Gaussian  mixture  model 

HMM 

hidden  Markov  model 

JSR 

joint  sparse  representation 

MSVAR 

Markov  switching  vector  auto-regression 

RBM 

restricted  Boltzmann  machine 

SLR 

sparse  logistic  regression 

SVM 

support  vector  machine 

Approved  for  public  release;  distribution  is  unlimited 


42 


1  DEFENSE  TECHNICAL 
(PDF)  INFORMATION  CTR 

DTIC  OCA 

2  DIRECTOR 

(PDF)  US  ARMY  RESEARCH  LAB 
RDRL  CIO  LL 

IMAL  HRA  MAIL  &  RECORDS  MGMT 

1  GOVT  PRINTG  OFC 
(PDF)  A  MALHOTRA 


Approved  for  public  release;  distribution  is  unlimited. 


43 


ABERDEEN  PROVING  GROUND 


1 1  DIR  USARL 
(PDF)  RDRL  SES  P 

M  SCANLON 
S  TENNEY 
L  SIM 

TD  TRAN-LUU 
CREIFF 
H  VU 
D  GONSKI 
J  GOLDMAN 
W  ALBERTS 
RDRL  SES  A 
T  PHAM 
RDRL  Cl 
L  SOLOMON 


1  MBO  PARTNERS 
(PDF)  M  DAO 


Approved  for  public  release;  distribution  is  unlimited. 


44 


